%pip install PAPyA==0.1.0
Configurations for SP2Bench Data
config_sp2bench = "settings.yaml" # config file location
logs_sp2bench = "log" # logs file location
Configurations for Watdiv Data
config_watdiv = "settings_watdiv.yaml" # config file location
logs_watdiv = "log_watdiv" # logs file location
The configuration file is a yaml data-serialization language which has two main parts, the dimensions and the number of query experiments. You can add more dimensions here or change these existing dimensions to anything you need
Example :
dimensions:
schemas: ["st", "vt", "pt", "extvt", "wpt"]
partition: ["horizontal", "predicate", "subject"]
storage: ["csv", "avro", "parquet", "orc"]
query: 11
the structure of the log files must follow the order of dimensions in the configuration file (i.e. {schemas}.{partition}.{storage}.txt) and the subfolders should be the ranking sets of the experiments (i.e. dataset sizes)
Example :
UI Module
└───log
│
|───100M
| │ st.horizontal.csv.txt
| │ st.horizontal.avro.txt
| │ ...
│
└───250M
| st.horizontal.csv.txt
│ st.horizontal.avro.txt
│ ...
SDRank is a class module from PAPyA library to calculate ranking score R for each dimension independently that operates over a log-based structure that user specified on the configuration file.
The value of R represents the performance of a particular configuration (higher value means better performing configuration). We used Ranking Function R below to calculate the rank scores:
$d$ : total number of parameters (options) in a particular dimension
$O_{dim}$ : number of occurences of the dimension placed at rank $r$ (Rank 1, Rank 2, Rank 3, ...)
$|Q|$ : total number of queries
Parameters:
config_path : str
Specify the path to your configuration file(s). i.e ./UIModule/settings_watdiv.yaml</i>
log_path : str
Specify the path to your log file(s). i.e ./UI Module/log_watdiv</i>
ranking_sets : str
Ranking sets of user choice. i.e dataset sizes (100M)</i>
dimension : str
A single dimension to be ranked. i.e schemas</i>
# this class takes single dimension and dataset sizes as parameters that user specified inside their log files
from Rank import SDRank
schemaSDRank_100M = SDRank(config_watdiv, logs_watdiv, '100M', 'schemas')
schemaSDRank_250M = SDRank(config_watdiv, logs_watdiv, '250M', 'schemas')
schemaSDRank_500M = SDRank(config_watdiv, logs_watdiv, '500M', 'schemas')
partitionSDRank_100M = SDRank(config_watdiv, logs_watdiv, '100M', 'partition')
partitionSDRank_250M = SDRank(config_watdiv, logs_watdiv, '250M', 'partition')
partitionSDRank_500M = SDRank(config_watdiv, logs_watdiv, '500M', 'partition')
storageSDRank_100M = SDRank(config_watdiv, logs_watdiv, '100M', 'storage')
storageSDRank_250M = SDRank(config_watdiv, logs_watdiv, '250M', 'storage')
storageSDRank_500M = SDRank(config_watdiv, logs_watdiv, '500M', 'storage')
The function that automates calculating the rank scores of a single dimension using the Ranking Function above.
Returns a table of configurations which is sorted based on the best performing configuration according to their Ranking Score along with number of occurences of the dimension being placed at the rank _r_ (1st, 2nd, 3rd, ...)
Parameters:
*args : str or list
This method takes an arbitrary number of parameters of strings and lists.
str -> slice the table according to string input. i.e. "predicate" will slice the table by the predicate partitioning
list -> remove some queries out of the ranking calculations. i.e [7,8,9] will remove query 7, 8,and 9 from the calculation</i>
exclude_query = [1,2,3]
schemaSDRank_100M.calculateRank(exclude_query).head()
| Rank 1 | Rank 2 | Rank 3 | Result | |
|---|---|---|---|---|
| pt.horizontal.parquet | 15.0 | 2.0 | 0.0 | 0.941176 |
| pt.subject.parquet | 13.0 | 3.0 | 1.0 | 0.852941 |
| pt.horizontal.orc | 13.0 | 2.0 | 2.0 | 0.823529 |
| st.predicate.orc | 9.0 | 8.0 | 0.0 | 0.764706 |
| pt.subject.orc | 9.0 | 7.0 | 1.0 | 0.735294 |
schemaSDRank_250M.calculateRank(exclude_query).head()
| Rank 1 | Rank 2 | Rank 3 | Result | |
|---|---|---|---|---|
| pt.horizontal.parquet | 16.0 | 1.0 | 0.0 | 0.970588 |
| pt.horizontal.orc | 14.0 | 3.0 | 0.0 | 0.911765 |
| pt.subject.parquet | 14.0 | 3.0 | 0.0 | 0.911765 |
| pt.subject.orc | 12.0 | 5.0 | 0.0 | 0.852941 |
| st.predicate.orc | 10.0 | 7.0 | 0.0 | 0.794118 |
schemaSDRank_500M.calculateRank(exclude_query).head()
| Rank 1 | Rank 2 | Rank 3 | Result | |
|---|---|---|---|---|
| pt.subject.parquet | 15.0 | 2.0 | 0.0 | 0.941176 |
| pt.horizontal.parquet | 14.0 | 3.0 | 0.0 | 0.911765 |
| pt.horizontal.orc | 13.0 | 4.0 | 0.0 | 0.882353 |
| vp.predicate.orc | 12.0 | 5.0 | 0.0 | 0.852941 |
| pt.subject.orc | 12.0 | 4.0 | 1.0 | 0.823529 |
partitionSDRank_100M.calculateRank(exclude_query).head()
| Rank 1 | Rank 2 | Rank 3 | Result | |
|---|---|---|---|---|
| vp.predicate.orc | 16.0 | 1.0 | 0.0 | 0.970588 |
| vp.predicate.parquet | 15.0 | 1.0 | 1.0 | 0.911765 |
| pt.horizontal.orc | 11.0 | 5.0 | 1.0 | 0.794118 |
| st.subject.orc | 10.0 | 7.0 | 0.0 | 0.794118 |
| st.subject.parquet | 9.0 | 8.0 | 0.0 | 0.764706 |
partitionSDRank_250M.calculateRank(exclude_query).head()
| Rank 1 | Rank 2 | Rank 3 | Result | |
|---|---|---|---|---|
| pt.horizontal.parquet | 12.0 | 5.0 | 0.0 | 0.852941 |
| st.predicate.parquet | 13.0 | 2.0 | 2.0 | 0.823529 |
| st.subject.orc | 10.0 | 7.0 | 0.0 | 0.794118 |
| pt.horizontal.orc | 9.0 | 8.0 | 0.0 | 0.764706 |
| pt.subject.orc | 8.0 | 8.0 | 1.0 | 0.705882 |
partitionSDRank_500M.calculateRank(exclude_query).head()
| Rank 1 | Rank 2 | Rank 3 | Result | |
|---|---|---|---|---|
| st.predicate.parquet | 16.0 | 1.0 | 0.0 | 0.970588 |
| pt.subject.parquet | 13.0 | 4.0 | 0.0 | 0.882353 |
| pt.subject.orc | 13.0 | 4.0 | 0.0 | 0.882353 |
| vp.horizontal.parquet | 12.0 | 5.0 | 0.0 | 0.852941 |
| st.predicate.orc | 10.0 | 7.0 | 0.0 | 0.794118 |
storageSDRank_100M.calculateRank(exclude_query).head()
| Rank 1 | Rank 2 | Result | |
|---|---|---|---|
| st.subject.orc | 17.0 | 0.0 | 1.000000 |
| st.horizontal.orc | 16.0 | 1.0 | 0.941176 |
| st.predicate.orc | 12.0 | 5.0 | 0.705882 |
| vp.subject.orc | 11.0 | 6.0 | 0.647059 |
| pt.predicate.parquet | 10.0 | 7.0 | 0.588235 |
storageSDRank_250M.calculateRank(exclude_query).head()
| Rank 1 | Rank 2 | Result | |
|---|---|---|---|
| st.subject.orc | 17.0 | 0.0 | 1.000000 |
| st.horizontal.orc | 17.0 | 0.0 | 1.000000 |
| vp.subject.orc | 13.0 | 4.0 | 0.764706 |
| vp.predicate.orc | 12.0 | 5.0 | 0.705882 |
| pt.horizontal.parquet | 12.0 | 5.0 | 0.705882 |
storageSDRank_500M.calculateRank(exclude_query).head()
| Rank 1 | Rank 2 | Result | |
|---|---|---|---|
| st.horizontal.orc | 17.0 | 0.0 | 1.000000 |
| st.subject.orc | 17.0 | 0.0 | 1.000000 |
| vp.predicate.orc | 15.0 | 2.0 | 0.882353 |
| pt.horizontal.orc | 14.0 | 3.0 | 0.823529 |
| vp.subject.orc | 14.0 | 3.0 | 0.823529 |
Ranking over one dimension is insufficient when it counts multiple dimensions. The presence of trade-offs reduces the accuracy of single dimension ranking functions which could be seen in the radar plot below.
This method returns a radar chart that shows the presence of trade-offs by using the single dimension ranking criterion that reduces the accuracy of the other dimensions
# This example shows a figure of the top configuration of ranking by schema is optimized towards its dimension only, ignoring the other two dimension.
from Rank import SDRank
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plotRadar()
SDRank(config_watdiv, logs_watdiv, '100M', 'partition').plotRadar()
SDRank(config_watdiv, logs_watdiv, '100M', 'storage').plotRadar()
In addition to radar plot, PAPyA also provides visualization that shows the performance of a single dimension parameters that user can choose in terms of their rank scores
This method returns a bar chart that shows a particular dimension rankings, pivoting another dimension by user viewing projection
Parameters:
view : str
This method takes a string of dimensional option that user choose to view as projection
Example:
_Schemas_ single dimensional ranking viewed from _Predicate Partitioning_ by pivoting over the _Storage_ dimension
from Rank import SDRank
# example of schema dimension plots
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plot('predicate')
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plot('horizontal')
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plot('subject')
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plot('orc')
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plot('parquet')
<AxesSubplot:title={'center':'Schemas SD Rank pivoting Partition formats for Parquet Storage'}>
In order to show the distributions of our query runtimes data, we need a box plot diagram to compare these data between queries in our experiment. Box plot can help provide information at a glance, giving us general information about our data
This method returns a box plot diagram that shows both maximum and minimum runtimes of each individual queries for a particular single dimensional ranking
Parameters:
q : list
This method can take a list of column names that user can specify to choose for plotting to have a more precise box plot diagram. i.e. list of ["Q1", "Q2", "Q3"] will output a
box plot only for those queries
from Rank import SDRank
# Box plot example of all queries of schema ranking dimension
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plotBox()
SDRank(config_watdiv, logs_watdiv, '100M', 'partition').plotBox()
SDRank(config_watdiv, logs_watdiv, '100M', 'storage').plotBox()
# Box plot example of query 1,2,3 of schema ranking dimension
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plotBox(["Q1", "Q2", "Q3"])
With the presence of the trade-offs introduced in the single dimensional ranking function, we propose an optimization technique that aims to find the non-dominated solutions or the configuration combinations by optimizing all dimensions at the same time which utilize the NSGA2 Algorithm.
In this experiment, we provide two ways to use the NSGA2 Algorithm:
Parameters:
config_path : str
Specify the path to your configuration file(s). i.e ./UIModule/settings_watdiv.yaml</i>
log_path : str
Specify the path to your log file(s). i.e ./UI Module/log_watdiv</i>
ranking_sets : str
Ranking sets of user choice. i.e dataset sizes (100M)</i>
from Rank import MDRank
# example of MDRank class with 100M dataset size as ranking set of the experiment
multiDimensionRank_100M = MDRank(config_watdiv, logs_watdiv, '100M')
multiDimensionRank_250M = MDRank(config_watdiv, logs_watdiv, '250M')
multiDimensionRank_500M = MDRank(config_watdiv, logs_watdiv, '500M')
This method returns a table of configuration solutions as well as their dominated ones, according to the Non-Dominated Sorting Algorithm II (NSGA2) that was applied to minimizing query runtimes of the ranked dimensions
multiDimensionRank_100M.paretoQ().head()
| Solution | Dominated | |
|---|---|---|
| 0 | pt.horizontal.parquet | st.horizontal.orc |
| 1 | pt.subject.parquet | st.horizontal.parquet |
| 2 | pt.horizontal.orc | pt.predicate.parquet |
| 3 | pt.subject.orc | pt.predicate.orc |
| 4 | st.subject.orc |
multiDimensionRank_250M.paretoQ().head()
| Solution | Dominated | |
|---|---|---|
| 0 | pt.horizontal.parquet | st.subject.parquet |
| 1 | pt.subject.orc | st.horizontal.orc |
| 2 | pt.subject.parquet | st.horizontal.parquet |
| 3 | pt.horizontal.orc | pt.predicate.orc |
| 4 | vp.horizontal.parquet | pt.predicate.parquet |
multiDimensionRank_500M.paretoQ().head()
| Solution | Dominated | |
|---|---|---|
| 0 | pt.subject.orc | st.predicate.orc |
| 1 | pt.subject.parquet | vp.predicate.parquet |
| 2 | pt.horizontal.orc | st.predicate.parquet |
| 3 | pt.horizontal.parquet | st.horizontal.orc |
| 4 | vp.subject.orc | st.subject.parquet |
This method returns a table of configuration solutions as well as their dominated ones, according to the Non-Dominated Sorting Algorithm II (NSGA2) that was applied on the single dimensional ranking criteria which maximizes the performance of all ranking sets altogether
multiDimensionRank_100M.paretoAgg().head()
| Solution | Dominated | |
|---|---|---|
| 0 | st.predicate.orc | st.horizontal.parquet |
| 1 | vp.predicate.parquet | pt.predicate.orc |
| 2 | pt.horizontal.orc | pt.predicate.parquet |
| 3 | pt.subject.parquet | vp.horizontal.orc |
| 4 | vp.predicate.orc | vp.subject.parquet |
multiDimensionRank_250M.paretoAgg().head()
| Solution | Dominated | |
|---|---|---|
| 0 | st.subject.orc | st.horizontal.parquet |
| 1 | pt.horizontal.parquet | pt.predicate.parquet |
| 2 | pt.predicate.orc | |
| 3 | st.subject.parquet | |
| 4 | st.horizontal.orc |
multiDimensionRank_500M.paretoAgg().head()
| Solution | Dominated | |
|---|---|---|
| 0 | vp.predicate.orc | st.horizontal.parquet |
| 1 | st.subject.orc | pt.predicate.parquet |
| 2 | st.predicate.parquet | st.subject.parquet |
| 3 | vp.subject.orc | pt.predicate.orc |
| 4 | pt.subject.parquet | vp.predicate.parquet |
This method returns a plot for multi dimensional ranking solutions according to paretoAgg as shades of green projected in a three dimensional space
multiDimensionRank_100M.plot()
# multiDimensionRank_250M.plot()
multiDimensionRank_500M.plot()
(7, 3) (11, 3)
(7, 3) (11, 3)
This library provides two metrics of evaluation to evaluate the goodness of the ranking criteria the conformance and coherence
Consider $R_{s}$ ranking and the top-3 ranked configurations are ${c_{1},c_{2},c_{3}}$, that overlaps only with the bottom-3 ranked configuration in query $|Q|$. That is, ${c_{4},c_{2},c_{5}}$. For example, $c_{2}$ is in the $59^{th}$ position out of 60 positions.
Thus, $A(R^k) = 1- \frac {1}{(11*3)}$, when $k = 3$ and $|Q| = 11$
In this experiment, we assume that rank sets are the dataset sizes (i.e. 100M and 250M). Kendall’s distance between two rank sets $R_{1}$ and $R_{2}$, where $|P|$ represents the set of unique pairs of distinct elements in the two sets. For instance, the $K$ index between $R_{1}={c_{1},c_{2},c_{3}}$ and $R_{2}={c_{1},c_{2},c_{4}}$ for 100M and 250M is 0.33, i.e., one disagreement out of three pair comparisons.
Parameters:
config_path : str
Specify the path to your configuration file(s). i.e ./UIModule/settings_watdiv.yaml</i>
log_path : str
Specify the path to your log file(s). i.e ./UI Module/log_watdiv</i>
ranking_sets : str
Ranking sets of user choice. i.e dataset sizes (100M)</i>
conformance_set : list
List of ranking criterions to see the scores of their conformances.
k : int
Value of the top k subset of the ranking criteria. i.e. _k_ = 5, takes the top 5 value of each ranking criteria in the conformance set</i>
h : int
Value of the threshold that will be counted for the conformance score. i.e. _h_ = 28, take queries that has a ranking below 28 out of all the configuration</i>
from Ranker import Conformance
conformance_set = ['schemas', 'partition', 'storage', 'paretoQ', 'paretoAgg']
conf = Conformance(config_watdiv, logs_watdiv, '250M', conformance_set, 2, 9)
This method returns a table of conformance scores for each of the ranking criterion that user specify in the conformance_set along with the k and h values for the formula
#conformance scores for all ranking criterions in 250M dataset size
conf.run()
| 250M | |
|---|---|
| schemas | 0.950 |
| partition | 0.675 |
| storage | 0.275 |
| paretoQ | 0.975 |
| paretoAgg | 0.775 |
This method returns a criteria table of a choosen ranking dimension, criteria table is a table that shows rank values of each queries by the top-k configurations
Parameters:
dimension : str
user's choosen dimension to view the criteria table
mode : 0 (default) or 1
'0' show criteria table by their value
'1' show criteria table with true and false of h value (true, if h value is higher than h)
conf.configurationQueryRanks(dimension = 'paretoAgg', mode = 1)
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pt.horizontal.parquet | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
| st.subject.orc | True | True | False | False | True | False | True | False | True | True | True | True | True | False | False | False | False | False | False | False |
Parameters:
config_path : str
Specify the path to your configuration file(s). i.e ./UIModule/settings_watdiv.yaml</i>
log_path : str
Specify the path to your log file(s). i.e ./UI Module/log_watdiv</i>
conformance_set : list
List of ranking criterions to see their kendall index's scores between ranking sets (i.e. dataset sizes).
from Ranker import Coherence
coherence_set = ['schemas', 'partition', 'storage', 'paretoQ', 'paretoAgg']
coh = Coherence(config_watdiv, logs_watdiv,coherence_set)
This method returns a table of coherence scores for each of the ranking criterion that user specify in the conformance_set by calculating the number of (dis)agreements between 2 ranking sets
Parameters:
rankset1 : str
a string of the _first_ rankset that user wants to compare
rankset2 : str
a string of the _second_ rankset that user wants to compare
#example of coherence scores for all ranking criterions by comparing 100M ranking set with 250M ranking set
coh.run('100M', '500M')
| Kendall's Index | |
|---|---|
| schemas | 0.333333 |
| partition | 0.483660 |
| storage | 0.614379 |
| paretoQ | 0.637363 |
| paretoAgg | 0.533333 |
This method returns a heat map table that shows the coherence between two particular ranking sets that user can choose, the heat map will be sorted by the best performing configurations of the first ranking set
Parameters:
rankset1 : str
a string of the _first_ rankset that user wants to compare
rankset2 : str
a string of the _second_ rankset that user wants to compare
dimension : str
user's choosen dimension to view the heat map
coh.heatMap('100M', "250M", dimension='paretoQ')
This method shows the coherence differences between two particular ranking sets that user chooses, the first ranking sets will be the pivot point (between 100M-250M, and 100M-500M shown in the example). The heat map will be sorted by the best performing configurations of the first ranking set
Parameters:
*args : str
takes an arbitrary number of ranking sets, the first ranking set will be the pivot point for all the other ranking sets
dimension : str
user's choosen dimension to view the heat map
coh.heatMapSubtract('100M', '250M', '500M', dimension='paretoAgg')